Put your name and student ID here
Name: Mohammad Jawad Nayosh
Student ID: T01242238
Summary This project aimed to predict median house values in California districts using various features. I explored the dataset, visualized geographical data, prepared the data for machine learning, trained several models, and evaluated their performance. The final visualization highlighted the geographical distribution of median house values and the predicted values generated by your chosen model.
install python packages
from google.colab import drive
drive.mount('/content/drive')
%cd "/content/drive/MyDrive/ANLY 6110_ II/Module 4 Training Model /M4P2 EtE Machine Learning Project"
%ls
%pwd
%pip install geopandas
%pip install contextily
%pip install mapclassify
import python packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import contextily as cx
import mapclassify as mc
housing = pd.read_csv("/content/drive/MyDrive/ANLY 6110_ II/Module 4 Training Model /M4P2 EtE Machine Learning Project/housing.csv")
us_gdf = gpd.read_file('/content/drive/MyDrive/ANLY 6110_ II/Module 4 Training Model /M4P2 EtE Machine Learning Project/map/cb_2023_us_state_500k.shp')
housing
This data includes metrics such as the population, median income, and median housing price for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). I will call them “districts” for short. Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.
housing.shape
housing.info()
There are 20,640 instances in the dataset, which means that it is fairly small by machine learning standards, but it’s perfect to get started. We also notice that the total_bedrooms attribute has only 20,433 non-null values, meaning that 207 districts are missing this feature. We will need to take care of this later.
All attributes are numerical, except for ocean_proximity. Its type is object, so it could hold any kind of Python object. But since we loaded this data from a CSV file, we know that it must be a text attribute. When we looked at the top five rows,we notice that the values in the ocean_proximity column were repetitive, which means that it is probably a categorical attribute. We can find out what categories exist and how many districts belong to each category by using the value_counts() method:
housing["ocean_proximity"].value_counts()
housing.describe()
housing.columns
new_table = housing[["population","median_income","ocean_proximity"]].copy()
new_table
c = housing["total_rooms"] # let us have a look to same data, let us check room destributions
c
c = housing["ocean_proximity"]
vc = c.value_counts()
vc
fig = plt.figure(figsize= (8,6))
ax1 =plt.subplot(1,1,1)
plt.bar(vc.index, vc.values)
plt.show
fig =plt.figure(figsize= (8,6))
ax1 =plt.subplot(1,1,1)
plt.scatter(housing['housing_median_age'],housing['population'], c= housing['population'],s=20, cmap='jet') # removed plt.style
plt.colorbar()
plt.xlabel('housing_median_age')
plt.ylabel('population')
plt.show()
Looks like the majority of populaiton are living in the houses with 0-10 years of age. The oder the houses, the less popluation are living there.
housing.columns
To check for correlation between attributes we need to use the Pandas scatter_matrix() function, which plots every numerical attribute against every other numerical attribute. Since there are now 11 numerical attributes, we would get 112 = 121 plots, which would not fit on a page—so we decided to focus on a few promising attributes that seem most correlated with the median housing value
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
Looking at the correlation scatterplots, it seems like the most promising attribute to predict the median house value is the median income, so let us zoom in its scatterplot
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1, grid=True)
plt.show()
This plot reveals that the correlation is indeed quite strong; we can clearly see the upward trend, and the points are not too dispersed. Also, the price cap we noticed earlier is clearly visible as a horizontal line at $500,000. But the plot also reveals other less obvious straight lines: a horizontal line around $450,000, another around $350,000, perhaps one around $280,000, and a few more below that. We may want to try removing the corresponding districts to prevent our algorithms from learning to reproduce these data quirks.
# Next let show our scatter plot on the map
us_gdf.head(5) #this is our map's data, let us change the index
us_gdf.set_index(['STUSPS'], drop=False, inplace=True) # false means to would like to keep the previous index and True means to modify my original data
us_gdf
fig =plt.figure(figsize= (8,6))
ax1 =plt.subplot(1,1,1)
plt.scatter(housing['longitude'],housing['latitude'], s=housing['population']/100, c=housing['median_house_value'], cmap='jet')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.colorbar()
plt.show()
from inspect import Attribute
#now lets plot the US map:
fig =plt.figure(figsize=(30,25))
ax1=plt.subplot(1,1,1)
us_gdf.boundary.plot(ax=ax1, color="black")
cx.add_basemap(ax=ax1,crs=us_gdf.crs,attribution="", source=cx.providers.OpenTopoMap)
plt.axis(False)
plt.show()
# we we need to plot the CA map for which we need to extract the CA data as following:
ca_gdf = us_gdf.loc[["CA"]]
ca_gdf
fig =plt.figure(figsize=(10,8))
ax1=plt.subplot(1,1,1)
ca_gdf.boundary.plot(ax=ax1, color="black")
cx.add_basemap(ax=ax1,crs=ca_gdf.crs,attribution="", source=cx.providers.OpenTopoMap)
plt.scatter(housing['longitude'],housing['latitude'], s=housing['population']/100, c=housing['median_house_value'], cmap='jet', label='population')
plt.legend()
plt.colorbar()
plt.axis(False)
plt.show()
Let's revert to the original training set and separate the target (note that strat_train_set.drop() creates a copy of strat_train_set without the column, it doesn't actually modify strat_train_set itself, unless you pass inplace=True):
# To clean the data, we can have the following strategies:
#Process the missing values,
#Drop the colums that include missing values (which is nto a good idea ad it removes the whole column)
#Drop the row (We will use this approach here)
#Fill the values with mean, median (If we would like use this strategy we need to have a valid justification for it)
housing.info()
housing.dropna(subset=["total_bedrooms"], axis = 0) #axis = 0 means we need to delete the rows with missing values. applying the drop function creates a new dataset, the orriginal data still has missing values.
#if we want to drop the column with missing value, the we put the axis = 1
housing.drop(["total_bedrooms"], axis = 1)
#Fill the values with mean, median, the third strategy
# let us fill the null values with average
avg = housing["total_bedrooms"].mean()
avg
housing["total_bedrooms"].fillna(avg)
#for our project we use the first strategy ( dorping the rows with null values)
housing.dropna(subset=["total_bedrooms"], axis = 0, inplace=True) # inplace=True bring changes in the original dataset.
housing.info()
Separating out the numerical attributes to use the "median" strategy (as it cannot be calculated on text attributes like ocean_proximity):
Now let's preprocess the categorical input feature, ocean_proximity:
preparing the data for a machine learning model. Specifically categorical feature "ocean_proximity"
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
housing["ocean_proximity"]
# As we can see, the ocean_proximity column is string, we need to change it to intiger, unsing the ordinaryencoder and onehotencoder
# we use ordinaryencoder if the o_c is in order, if not we use onehotencoder. So in our case we use orndinaryencoder
vc = housing["ocean_proximity"].value_counts()
vc
#OrdinaryEncoder
ordinal_encoder = OrdinalEncoder() # Now let us ask it to learn from data
ordinal_encoder.fit(housing[["ocean_proximity"]])
#Let us see what is inside our encoder
ordinal_encoder.categories_
#Let us transform it. this can catagorize our data from 0 to 4
ordinal_encoder.transform(housing[["ocean_proximity"]])
#if we use onehotencoder we will have five column because we have fieve catagories. for each column the value in each cell can 0 and 1
#And we use this becase our data is not in order
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(housing[["ocean_proximity"]])
X_P1= onehot_encoder.transform(housing[["ocean_proximity"]]).toarray()
X_P1
#By default, the `OneHotEncoder` class returns a sparse array, but we can convert it to a dense array if needed by calling the `toarray()
# so, we will use it and let us to save it
onehot_encoder.get_feature_names_out()
# create a data fram
X_P1_df = pd.DataFrame(X_P1, columns=onehot_encoder.get_feature_names_out(), index=housing[["ocean_proximity"]].index)
X_P1_df
#Now let us combine our data frams into one
housing_df = pd.merge(left=housing, right=X_P1_df, left_index=True, right_index=True)
housing_df
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(housing_df, test_size=0.2, stratify= housing_df["ocean_proximity"])
train_df
# Here I would explian two methods: min max scaler and stadard scaler
#Some Rules:
#Scale: most of the time, we should scale on X (features).
#Learn (fit) from the train data.
#Apply transfom to both train and test data.
# min max: (x-min)/(max-min) [0,1]
# standard: normal distribution (x-mean)/std_deviation [-3,3]
# let us specify our x_train data first and for this we do not consider the hot encoder columns that we created before, we will just use the first 8 columns.
train_df.columns
X_P1_columns = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income'] # median_house_value will by our Y, so for the first part we considered the first 8 columns
X_train_P1 = train_df[X_P1_columns]
X_train_P1
# now let us specify the second part, the catagorical part
X_P2_columns = ['ocean_proximity_<1H OCEAN',
'ocean_proximity_INLAND', 'ocean_proximity_ISLAND',
'ocean_proximity_NEAR BAY', 'ocean_proximity_NEAR OCEAN']
X_train_P2 = train_df[X_P2_columns]
X_train_P2
y_train = train_df["median_house_value"]
y_train
X_test = test_df[X_P1_columns]
X_test
X_test_P2 = test_df[X_P2_columns]
X_test_P2
y_test = test_df["median_house_value"]
y_test
#So, we got our values and now let us train the model:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
minmax_scaler = MinMaxScaler()
minmax_scaler.fit(X_train_P1)
Let's try the full preprocessing pipeline on a few training instances:
Compare against the actual values:
X_train_P1_scaled = minmax_scaler.transform(X_train_P1)
minmax_scaler.transform(X_test)
# As explained earlier, we are not going to use the minmax scaler. In this porject we use the standard scaler
standard_scaler = StandardScaler()
standard_scaler.fit(X_train_P1)
# Because we use this method, let us apply and save it:
X_train_P1_scaled = standard_scaler.transform(X_train_P1)
X_train_P1_scaled
#and pringing this we will see most of the values are between 3 and -3
X_test_P1_scaled = standard_scaler.transform(X_test)
#Now we are ready to slect machine learning model and train it.
#But before that we need to connect the features togather: or to cmpbain the part1 and part2 togather:
X_train = np.hstack([X_train_P1_scaled, X_train_P2. to_numpy()])
X_train.shape
X_test = np.hstack([X_test_P1_scaled, X_test_P2.to_numpy()])
X_test.shape
#Now we can create severa models: here we create a reggression model becasue we need to predict the median house value which is continuous number
#First we will use linear reggression:
from sklearn.linear_model import LinearRegression
l_m = LinearRegression(n_jobs=-1)
l_m.fit(X_train, y_train)
#After building the model, let us varify the results
y_test_lm_pred = l_m.predict(X_test)
y_test_lm_pred
test_df['linear'] = y_test_lm_pred/y_test-1
# we predicted the value, no how we measure to know if it is good or bad? No we use square errors
from sklearn.metrics import mean_squared_error
lm_rmsc = mean_squared_error(y_true=y_test, y_pred=y_test_lm_pred)
lm_rmsc
#let us calculate the error ratio:
y_test_lm_pred/y_test-1 # this means the proce here is higher by (0.14) or 14% that the ground throuth. - lower, + higher
#Now we can clacute the absolut value and its avrage. this can give is more cleare idea bout or models error:
np.average(np.abs(y_test_lm_pred/y_test-1))
#So our linear model has 29% error
The MSE value of 4969943836.149107 in the context of our housing price prediction model.
Interpretation of MSE:
Magnitude: The MSE value is quite large. This suggests that, on average, the squared difference between our model's predicted housing prices and the actual housing prices is substantial. Units: We must remember that MSE is expressed in squared units of our target variable. In this case, it's squared dollars (since we're predicting house values). Desirability: Generally, a lower MSE is preferred, as it indicates better model accuracy. A very high MSE like this one suggests that the model's predictions could be significantly off from the actual values.
In conclusion: While the high MSE suggests the model is not performing optimally, it's not necessarily a dead end. By systematically exploring improvements in feature engineering, model selection, and hyperparameter tuning, we can likely achieve better predictive accuracy.
#Let us build anathor model
from sklearn.tree import DecisionTreeRegressor
dt_m = DecisionTreeRegressor()
dt_m.fit(X_train, y_train)
y_test_treem_predected = dt_m.predict(X_test)
treem_rmsc = mean_squared_error(y_true=y_test, y_pred=y_test_treem_predected)
treem_rmsc
np.average(np.abs(y_test_treem_predected/y_test-1))
test_df['Decission Tree'] = y_test_treem_predected/y_test-1
# we can also build a KNeighborsRegression
from sklearn.neighbors import KNeighborsRegressor
knn_m = KNeighborsRegressor()
knn_m.fit(X_train, y_train)
#Lets make predictin for this model:
y_test_knn_predected = knn_m.predict(X_test)
knn_rmsc = mean_squared_error(y_true=y_test, y_pred=y_test_knn_predected)
knn_rmsc
test_df['KNN'] = y_test_knn_predected/y_test-1
np.average(np.abs(y_test_knn_predected/y_test-1))
#Let us build Rondomforest model:
from sklearn.ensemble import RandomForestRegressor
rf_m = RandomForestRegressor()
rf_m.fit(X_train, y_train)
#Let us make prediction for this model:
y_test_rf_predected = rf_m.predict(X_test)
rf_rmsc = mean_squared_error(y_true=y_test, y_pred=y_test_rf_predected)
rf_rmsc
test_df['Random Forest'] = y_test_rf_predected/y_test-1
np.average(np.abs(y_test_rf_predected/y_test-1))
# RF model looks much better with minimum erro comparing to all other models we observed.
#Our last model Svr
from sklearn.svm import SVR
svr_m = SVR()
svr_m.fit(X_train, y_train)
y_test_srvm_pred = svr_m.predict(X_test)
svrm_rmsc = mean_squared_error(y_true=y_test, y_pred=y_test_srvm_pred)
svrm_rmsc
test_df['SVR'] = y_test_srvm_pred/y_test-1
np.average(np.abs(SVR_pred/y_test-1))
this is another way to check if a model is good or bad, but it will take too much time because it will create several model.
from sklearn.model_selection import cross_val_score
lm_C_rmsc = -cross_val_score(l_m, X_train, y_train, scoring="neg_mean_squared_error", cv=10) # we put a (_) at the bgining of the code to deal with the neg in "neg"_mean_squared..)
lm_C_rmsc
lm_C_rmsc.mean()
test_df
#Frist we prepare our geopandas data for visualization
test_housing_gdf = gpd.GeoDataFrame(test_df, geometry=gpd.points_from_xy(test_df["longitude"], test_df["latitude"]), crs=ca_gdf.crs)
# let us copy our CA map and show our relsult on it
fig =plt.figure(figsize=(10,8))
ax1=plt.subplot(1,1,1)
ca_gdf.boundary.plot(ax=ax1, color="none")
cx.add_basemap(ax=ax1,crs=us_gdf.crs,attribution="", source=cx.providers.OpenStreetMap.Mapnik)
test_housing_gdf.plot(ax=ax1, markersize = test_housing_gdf["population"]/200, column=test_housing_gdf["median_house_value"], cmap=plt.cm.jet, legend=True)
plt.axis(False)
plt.axis(False)
plt.title("Median House Value in California Districts") # Adding a title
plt.savefig("test_housing_gdf.png", dpi=600) #Fixed: Changed 'savfig' to 'savefig'
#To save this picture - This line is redundant as the figure is already saved above. you may remove it.
#plt.savfig("test_housing_gdf.png", dpi=600)
plt.show()
#Let us explore our map further
test_housing_gdf.explore(column=test_housing_gdf["median_house_value"], cmap="jet", legend=True)
The final exploratory map effectively visualizes the predicted median house values across California districts, using a color gradient to represent the price range. It allows for interactive exploration, enabling users to zoom in and out, pan across the map, and hover over individual points to see specific details like location and predicted value. This interactive feature provides a powerful tool for understanding the geographical distribution of housing prices and identifying potential hotspots or areas of interest in the California housing market."
.
This project provided a comprehensive exploration of predicting median house values in California districts using machine learning. By applying a systematic approach encompassing data exploration, preparation, model training, evaluation, and visualization, we gained valuable insights into the housing market.
The Random Forest Regression model emerged as the most effective among the tested models, demonstrating promising predictive capabilities. The geographical visualization highlighted spatial patterns in housing prices, revealing potential influencing factors such as proximity to urban centers and coastal areas.
While the achieved results are encouraging, further research and model refinements could enhance predictive accuracy. This could involve hyperparameter tuning, feature engineering, or integrating external datasets to incorporate additional relevant information.
Overall, this project demonstrates the potential of machine learning for data-driven decision-making in the real estate domain. By leveraging the insights gained, stakeholders can make more informed choices regarding property valuations, investment strategies, and market analysis.
Mohammad Jawad Nayosh